-
Notifications
You must be signed in to change notification settings - Fork 2.2k
channeldb: fix race condition in link node pruning #10462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
channeldb: fix race condition in link node pruning #10462
Conversation
Summary of ChangesHello @ziggie1984, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical race condition within the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly identifies and fixes a critical TOCTOU race condition in MarkChanFullyClosed and pruneLinkNode. By moving the link node existence check and deletion into the same atomic database transaction, it prevents a state where a link node could be deleted despite a new channel being opened. The fix in MarkChanFullyClosed is direct and effective. The fix in pruneLinkNode correctly uses a double-check pattern to maintain performance while ensuring correctness. I have one suggestion to improve the implementation of pruneLinkNode for better code clarity and consistency.
faf8cca to
d249c83
Compare
|
Happened out there in the wild, was reported by a user: So although a Pending OpenChannel flow was going on in the background we deleted the node from the db. |
50cad71 to
1186e3e
Compare
|
My node was impacted by this bug. We had to ask the peer to force close the channel and then used chantools to sweep the funds. |
| // Decide whether we want to remove the link node, based upon the number | ||
| // of still open channels. | ||
| return c.pruneLinkNode(openChannels, pruneLinkNode) | ||
| err = deleteLinkNode(tx, remotePub) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, we should have been doing this in the same db transaction the entire time.
Do we need any sort of reconciliation logic on start up to handle instances that might've happened in the wild? So sanity check that for each open channel, we have a link node present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a fix at startup, not sure if it is worth it but I think it does not cost much either and can be removed when moving to sql native ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if it is worth it
My thinking is that there may be nodes in an anomalous state that have a channel, but not an active LinkNode.
Eg, on start up, we use this to determine who we ned to connect out to:
Lines 3489 to 3495 in 801de79
| // Iterate through the list of LinkNodes to find addresses we should | |
| // attempt to connect to based on our set of previous connections. Set | |
| // the reconnection port to the default peer port. | |
| linkNodes, err := s.chanStateDB.LinkNodeDB().FetchAllLinkNodes() | |
| if err != nil && !errors.Is(err, channeldb.ErrLinkNodesNotFound) { | |
| return fmt.Errorf("failed to fetch all link nodes: %w", err) | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh you are right did not see that. But this is now fixed with the repair function in the beginning of the server startup.
yyforyongyu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️
| "from database", | ||
| remotePub.SerializeCompressed()) | ||
|
|
||
| return c.linkNodeDB.DeleteLinkNode(remotePub) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guess this DeleteLinkNode can now be deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is still used for tests so that we can add and remove link nodes, I think we can still keep it.
|
@ziggie1984, remember to re-request review from reviewers when ready |
| // Decide whether we want to remove the link node, based upon the number | ||
| // of still open channels. | ||
| return c.pruneLinkNode(openChannels, pruneLinkNode) | ||
| err = deleteLinkNode(tx, remotePub) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if it is worth it
My thinking is that there may be nodes in an anomalous state that have a channel, but not an active LinkNode.
Eg, on start up, we use this to determine who we ned to connect out to:
Lines 3489 to 3495 in 801de79
| // Iterate through the list of LinkNodes to find addresses we should | |
| // attempt to connect to based on our set of previous connections. Set | |
| // the reconnection port to the default peer port. | |
| linkNodes, err := s.chanStateDB.LinkNodeDB().FetchAllLinkNodes() | |
| if err != nil && !errors.Is(err, channeldb.ErrLinkNodesNotFound) { | |
| return fmt.Errorf("failed to fetch all link nodes: %w", err) | |
| } |
channeldb/db.go
Outdated
|
|
||
| // Link node doesn't exist, so we need to create it. | ||
| // We create it within a transaction to ensure consistency. | ||
| err = kvdb.Update(c.backend, func(tx kvdb.RwTx) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not aggregate the DB transactions? WE can call into linkNodeDB in the same transaction. This'll speed up start up time, if we are calling this on start up each time. So the first loop can make a set of the nodes we're connected to and check if a link node exists or not. Then one db txn makes all the links nodes, exiting earlier if that's not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point did not pay attention that every linknodedb call was creating its own transaction. Now only 2 transactions are used. Also enhanced the linknodedb abstraction with the appropriate functions.
This commit fixes a critical race condition in MarkChanFullyClosed and pruneLinkNode where link nodes could be incorrectly deleted despite having pending or open channels. The race occurred because the check for open channels and the link node deletion happened in separate database transactions: Thread A: TX1 checks open channels → [] (empty) Thread A: TX1 commits Thread B: Opens new channel with same peer Thread A: TX2 deletes link node (using stale data) Result: Link node deleted despite pending channel existing This creates a TOCTOU (time-of-check to time-of-use) vulnerability where database state changes between reading the channel count and deleting the node. Fix for MarkChanFullyClosed: - Move link node deletion into the same transaction as the channel closing check, making the check-and-delete operation atomic Fix for pruneLinkNode: - Add double-check within the write transaction to verify no channels were opened since the caller's initial check - Maintains performance by keeping early return for common case - Prevents deletion if channels exist at delete time This ensures the invariant: "link node exists iff channels exist" is never violated, preventing database corruption and potential connection issues.
eb5c070 to
270fc8d
Compare
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses a critical race condition in link node pruning by ensuring that channel checks and node deletions occur within the same atomic database transaction. The introduction of RepairLinkNodes is a thoughtful addition to correct any database inconsistencies that may have resulted from the previous bug. The changes are well-structured, and the new tests are comprehensive. I have one suggestion to refactor a new function for improved readability.
We make sure that nodes previously suffering from this error will have a consitent db view when restarting their node.
270fc8d to
d9fb909
Compare
Roasbeef
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🥨
|
Successfully created backport PR for |
…20.x-branch [v0.20.x-branch] Backport #10462: channeldb: fix race condition in link node pruning
This commit fixes a critical race condition in MarkChanFullyClosed and
pruneLinkNode where link nodes could be incorrectly deleted despite
having pending channel.
The race occurred because the check for open channels and the link node
deletion happened in separate database transactions:
Thread A: TX1 checks open channels → [] (empty)
Thread A: TX1 commits
Thread B: Opens new channel with same peer
Thread A: TX2 deletes link node (using stale data)
Result: Link node deleted despite pending channel existing
This creates a TOCTOU (time-of-check to time-of-use) vulnerability where
database state changes between reading the channel count and deleting
the node.
Fix for MarkChanFullyClosed:
closing check, making the check-and-delete operation atomic
Fix for pruneLinkNode:
were opened since the caller's initial check
This ensures the invariant: "link node exists iff channels exist"
is never violated, preventing database corruption and potential
connection issues.